About Dataset

Problem Statement:

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted,which treat the candidate data sets as binary classification problems. WhatsApp%20Image%202022-10-06%20at%2012.58.22%20AM.jpeg

Attribute Information:

Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency . The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:

  1. Mean of the integrated profile.
  2. Standard deviation of the integrated profile.
  3. Excess kurtosis of the integrated profile.
  4. Skewness of the integrated profile.
  5. Mean of the DM-SNR curve.
  6. Standard deviation of the DM-SNR curve.
  7. Excess kurtosis of the DM-SNR curve.
  8. Skewness of the DM-SNR curve.
  9. Class

Task

Given a Pulsar Star’s information, build a machine learning model that can classify the Star.

Importing the libraries

Reading the data

Exploring the data

Integrated Profile

This curve seems to be a normal distribution. which is symmetric about the mean and regular thickness of tails

  1. Mean : Since mean is close to 50% mark and 25% and 75% are within 1 std. dev. from the mean. Large difference between min and max, hence larger std. dev. Hence the mean of the integrated profile seems volatile with a large standard deviation.

  2. Std. Dev : Since mean is close to 50% mark and 25% and 75% are within less than 1 std. dev. from the mean, this data is normally distributed. The max and min values are 3-4 standard deviations from the mean.

  3. Excess Kurtosis : Majority of this data (atleast 75%) is less than mean. Hence a large head portion in this distribution. Hence the distribution of the left of mean is more tightly spread than the right. This means the integrated profile's tails are generally the same size as normal distributions.

  4. Skewness : Majority of this data (definitely more 75%) is less than mean. Hence a large head portion in this distribution. Hence the distribution of the left of mean is more tightly spread than the right. Hence the integrated profile must not be very skewed.

DM-SNR Curve

This curve is expected to be more highly spread than a normal distribution, and skewed towards the right of the mean.

  1. Mean : Standard deviation of the mean is very high, with more than 75% values being less than mean. Hence mean of most curves are small

  2. Std. Dev : Very skewed towards the higher side. Can expect most values to have standard deviation lower than 28. However, comparing this with the distribution of DM-SNR curve, which is mostly less than 5. This is a very highly spread DM-SNR curve.

  3. Excess Kurtosis : This is normally distributed since mean = 50% value and every 25% is approx one std. dev. However, these values are high, hence the DM-SNR curve has fatter tails than normal distribution.

  4. Skewness : Very high values, and high standard deviation. Hence the DM-SNR curve is very skewed (expected)

Check if our data is balanced, why not?

Figure out Data

Mean of the integrated profile & Mean of the DM-SNR curve

Standard deviation of the integrated profile & Standard deviation of the DM-SNR curve

Excess kurtosis of the integrated profile & Excess kurtosis of the DM-SNR curve

Skewness of the integrated profile & Skewness of the DM-SNR curve


Correlation Heatmap

Correlations between 2 dependent variables

Highly positively correlated:

  1. Skewness of the integrated profile and Excess kurtosis of the integrated profile
  2. Skewness of the DM-SNR curve and Excess kurtosis of the DM-SNR curve
  3. Mean the DM-SNR curve and Standard Deviation of the DM-SNR curve

Highly negatively correlated:

  1. Mean of the integrated profile and Excess kurtosis of the integrated profile
  2. Mean of the integrated profile and Skewness of the integrated profile
  3. Excess kurtosis the DM-SNR curve and Standard Deviation of the DM-SNR curve

Correlations between independent and dependent variable

Highly positively correlated:

  1. Excess kurtosis of the integrated profile
  2. Skewness of the integrated profile

Highly negatively correlated:

  1. Mean of the integrated profil

Detect Outliers for Every columns

Since there are not many outliers we can either remove them or cap them. However, removing them is not advised, so we will cap them using IQR

Figure out outlier Before remove them and After

Scaling and Split the data

Split

Scaling

In the presence of outliers, StandardScaler does not guarantee balanced feature scales, due to the influence of the outliers while computing the empirical mean and standard deviation. This leads to the shrinkage in the range of the feature values.

balance the data

Why we have to balance the data?

The answer is quite simple, to make our predictions more accurate.

Because if we have imbalanced data, the model is more biased to the dominant target class and tends to predict the target as the predominant target class.

Techniques for handling imbalanced data

For handling imbalance data we are having many other ways, In this article, we will learn about the below techniques along with the code implementation.

Oversampling

10-oversampling.jpg

RandomOverSampler

SMOTETomek


Oversampling pros and cons

Pros:

Cons:


UnderSampling

17-undersampling.jpg

NearMiss method

RandomunderSampler method


Undersampling pros and cons

Pros:

Cons:

When to use oversampling VS undersampling

We have a fair amount of knowledge on these two data imbalance handling techniques, but we use them as both the methods are for handling the imbalanced data issue.


so we will use Oversampling : SMOTETomek method


Modeling

What are the Performance Evaluation Measures for Classification Models?

Logistic Regression

Extension of linear regression that’s used for classification tasks, meaning the output variable is binary (e.g., only black or white) rather than continuous (e.g., an infinite list of potential colors)

Decision Tree

Decision Tree is a tree-like graph where sorting starts from the root node to the leaf node until the target is achieved. It is the most popular one for decision and classification based on supervised algorithms. It is constructed by recursive partitioning where each node acts as a test case for some attributes and each edge, deriving from the node, is a possible answer in the test case. Both the root and leaf nodes are two entities of the algorithm.

Random Forest

Classification or regression model that improves the accuracy of a simple decision tree by generating multiple decision trees and taking a majority vote of them to predict the output, which is a continuous variable (eg, age) for a regression problem and a discrete variable (eg, either black, white, or red) for classification.The random forest algorithm is simple to use and an effective algorithm. It can predict with high accuracy, and that’s why it is very popular.Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.

KNN

K Nearest Neighbours is a basic algorithm that stores all the available and predicts the classification of unlabelled data based on a similarity measure. In linear geometry when two parameters are plotted on the 2D Cartesian system, we identify the similarity measure by calculating the distance between the points. The same applies here, KNN algorithm works on the assumption that similar things exist in close proximity, simply we can put into the same things stay close to each other.

XGBoost

The XGBoost is having a tree learning algorithm as well as linear model learning, and because of that, it is able to do parallel computation on a single machine. This makes it 10 times faster than any of the existing gradient boosting algorithms. it has become the "state-of-the-art” machine learning algorithm to deal with structured data.

Voting

Voting classifiers are ensemble of many classifiers. In voting classifiers we aggregate the predictions of each classifier and predict the class that gets the most votes. This majority vote classifier is called a hard voting classifier. Voting classifiers often achieves a higher accuracy than the best classifier in the ensemble

Types:

Hard Voting

predict_proba is not available when voting='hard'

Soft Voting

compersion between models